Utility of Web Content Blocks in Content Extraction

نویسنده

  • Marek Kowalkiewicz
چکیده

In this chapter we discuss the possible application of new concepts in web content extraction: utility assessment, utility annealing, and dynamic aggregated document generation. After analysis of the state of the art in web content extraction, results of a survey study among Polish managers are presented. The discussion covers a web content extraction system with possible extensions that may help tackle the information overload problem. The discussed extensions go beyond current state of the art. Utility assessment considers economical view on value of information, while utility annealing allows for removing content blocks that cover information already acquired from other content blocks. Due to the existing content block extraction technology and new concepts proposed in the chapter, it is possible to dynamically generate aggregated documents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Expected Utility of Content Blocks in Web Content Extraction

In this paper we discuss the possible application of new concepts in web content extraction: utility assessment, utility annealing, and dynamic aggregated document generation. After analysis of the state of the art in web content extraction, results of a survey study among Polish managers are presented. The discussion covers a web content extraction system with possible extensions that may help...

متن کامل

Identifying Informative Web Content Blocks using Web Page Segmentation

Information Extraction has become an important task for discovering useful knowledge or information from the Web. A crawler system, which gathers the information from the Web, is one of the fundamental necessities of Information Extraction. A search engine uses a crawler to crawl and index web pages. Search engine takes into account only the informative content for indexing. In addition to info...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Eliminating Noisy Information from News Websites and Extraction of the News article

A typical web page containing a news article contains many information blocks. Apart from the main content blocks, it usually has blocks such as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page the noisy blocks. In this paper, we propose an algorithm for th...

متن کامل

High Fuzzy Utility Based Frequent Patterns Mining Approach for Mobile Web Services Sequences

Nowadays high fuzzy utility based pattern mining is an emerging topic in data mining. It refers to discover all patterns having a high utility meeting a user-specified minimum high utility threshold. It comprises extracting patterns which are highly accessed in mobile web service sequences. Different from the traditional fuzzy approach, high fuzzy utility mining considers not only counts of mob...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006